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Abstract 

The Dirichlet prior is widely used in estimating discrete distributions and functionals of discrete distributions. In terms of 
Shannon entropy estimation, one approach is to plug-in the Dirichlet prior smoothed distribution into the entropy functional, while 
the other one is to calculate the Bayes estimator for entropy under the Dirichlet prior for squared error, which is the conditional 
expectation. We show that in general they do not improve over the maximum likelihood estimator, which plugs-in the empirical 
distribution into the entropy functional. No matter how we tune the parameters in the Dirichlet prior, this approach cannot achieve 
the minimax rates in entropy estimation, as recently characterized by Jiao, Venkat, Han, and Weissman 0, and Wu and Yang (2). 
The performance of the minimax rate-optimal estimator with n samples is essentially at least as good as that of the Dirichlet 
smoothed entropy estimators with n In n samples. 

We harness the theory of approximation using positive linear operators for analyzing the bias of plug-in estimators for general 
functionals under arbitrary statistical models, thereby further consolidating the interplay between these two fields, which was 
thoroughly developed and exploited by Jiao, Venkat, Han, and Weissman 0. We establish new results in approximation theory, 
and apply them to analyze the bias of the Dirichlet prior smoothed plug-in entropy estimator. This interplay between bias analysis 
and approximation theory is of relevance and consequence far beyond the specific problem setting in this paper. 

Index Terms 

entropy estimation, approximation theory using positive linear operators, Dirichlet prior smoothing, minimax optimality, 
maximum risk, high dimensional statistics, large alphabet 


I. Introduction 

The Shannon entropy of a discrete distribution, defined by 

S 1 

H(P)± V Pi ln-, (1) 

emerged in Shannon’s 1948 masterpiece J4) in the answers to the most fundamental questions of compression and communi¬ 
cation. In addition to its prominent roles in information theory. Shannon entropy has been widely applied in such disciplines 
as genetics a, image processing a, computer vision 0, secrecy (8), ecology 0, and physics am In most real-world 
applications, the true distribution underlying the generation of the data is unknown, making it impossible for us to compute 
the Shannon entropy exactly. Hence, in nearly every practical problem involving Shannon entropy, we need to estimate it from 
data. 

Classical theory is mainly concerned with the case where the number of i.i.d. samples n —> oo, while the alphabet size 
S is fixed. In that scenario, the maximum likelihood estimator (MLE), which plugs in the empirical distribution into the 
definition of entropy, is asymptotically efficient mi Thm. 8.11, Lemma 8.14] in the sense of the Hajek convolution theorem 
02 and the Hajek-Le Cam local asymptotic minimax theorem ED. In conUast, various modern data-analytic applications 
deal with datasets which do not fall into the regime of fixed S and n —> oo. In fact, in many applications the alphabet size S 
is comparable to or even larger than the number of samples n. Indeed, the Wikipedia page on the Chinese characters shows 
that the alphabet of the Chinese language is at least 80, 000 characters large. As another case in point, half of the words in 
the Shakespearean canon appeared only once ED- 

The problem of entropy estimation in the large alphabet regime (or non-asymptotic analysis) has been investigated extensively 
in various disciplines. We refer to 0 for a thorough review. One recent breakthrough in this direction came from Valiant and 
Valiant am who consttucted the first explicit entropy estimator of sample complexity n x C2_, which they also proved to be 
necessary. On the other hand, it was shown in fl6l |3l that the MLE requires n x S samples, implying that MLE is sUictly 
sub-optimal in terms of sample complexity. 

However, the aforementioned estimators have not been shown to achieve the minimax L 2 rates. In light of this. Jiao et al. 
0 , and Wu and Yang in 0 independently developed schemes based on approximation theory, and obtained the minimax 
L 2 convergence rates for the entropy. Lurther, Jiao et al. 0 proposed a general methodology for estimating functionals, and 
showed that for a wide class of functionals (including entropy and mutual information), their methodology yields minimax 
rate-optimal estimators whose performance with n samples is commensurate with that of the MLE with n 1 11 n samples. They 
also obtained minimax L 2 rates for estimating a large class of functionals. On the practical side. Jiao et al. ED showed that the 
minimax rate-optimal estimators introduced in 0 can lead to consistent and substantial performance boosts in various machine 
learning applications. It was argued in 0 that the “only” approach that can achieve the minimax rates for entropy must either 
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implicitly or explicitly conduct best polynomial approximation as |T[ did. A question that arises naturally then is whether 
modifications of the plug-in approach, such as the Dirichlet prior smoothing ideas, can improve and result in something which 
does achieve the minimax rates. This paper answers this question negatively. 

Dirichlet smoothing may have two connotations in the context of entropy estimation: 

• lfl8l , fl9l One first obtains a Bayes estimate for the discrete distribution P, which we denote by P/>, and then plugs it 
in the entropy functional to obtain the entropy estimate H{Pb). 

. EHI, ED One calculates the Bayes estimate for entropy H(P) under Dirichlet prior for squared error. The estimator is 
the conditional expectation E[fT(P)|X], where X represents the samples. 

We show in the present paper that neither approach results in improvements over the MLE in the large alphabet regime. 
Specifically, these approaches require at least n S to be consistent, while the minimax rate-optimal estimators such as the 
ones in ID El only need n to achieve consistency. 

A main motivation for the present paper, beyond that discussed above, is to demonstrate the power of approximation theory 
using positive linear operators for bounding the bias of plug-in estimators for functionals of parameters under arbitrary statistical 
models. It was first shown in Jiao et al. |3| that under mild conditions, the problem of bias analysis of plug-in estimators 
for functionals from arbitrary finite dimensional statistical models is equivalent to approximation theory using positive linear 
operators, a subfield of approximation theory which has been developing for more than a century. Applying advanced tools 
from positive linear operator theory If22l . Jiao et al. f3[ obtained tight non-asymptotic characterizations of maximum L 2 risks 
for MLE in estimating a variety of functionals of probability distributions. In this paper, we contribute to the general positive 
linear operator theory l22l . and use the Dirichlet smoothing prior plug-in estimator as an example to demonstrate the efficacy 
of this general theory in dealing with analysis of the bias in estimation problems. We believe this connection has far reaching 
implications beyond analyzing bias in statistical estimation, which itself is an important problem. 


A. Dirichlet Smoothing 

The Dirichlet smoothing is widely used in practice to overcome the undersampling problem, i.e., one observes too few 
samples from a distribution P. The Dirichlet distribution with order S > 2 with parameters o j,..., 0 5 > 0 has a probability 
density function with respect to Lebesgue measure on the Euclidean space R s_1 given by 


f (*tTj ? 


B(a) 


n 


a **" 1 


on the open S — 1-dimensional simplex defined by: 


( 2 ) 


Xi, ■ ■ ■ 

,xs- 1> 0 

(3) 

X\ + ■ 

• • +a: s _i < 1 

(4) 

x s = 1 - Xi - X S -1 

(5) 


and zero elsewhere. The normalizing constant is the multinomial Beta function, which can be expressed in terms of the Gamma 
function: 


B(a) = 


nr = , r(a.) 


a= {a 1 , • 


*s)- 


( 6 ) 


Denote by A is the set of discrete distributions supported on S elements. Assuming the unknown distribution P follows prior 
distribution P ~ Dir(a), and we observe a vector X = (A'i, X 2 ,..., Xs) with Multinomial distribution Multi(ro;pi,p 2 > • • • ,ps), 
then one can show that the posterior distribution Pp\x is also a Dirichlet distribution with parameters 


OL + X — (tti + Xi, OL 2 + X 2 , • ■ • , OtK + Xx) ■ 


(7) 


Furthermore, the posterior mean (conditional expectation) of pi given X is given by |[23] Example 5.4.4] 


<Si(X) =E[pi|X] 


oti + Xi 

n + Li =1 a i 


( 8 ) 


The estimator r),;(X) is widely used in practice for various choices of a. For example, if ai = then the corresponding 
(<5i(X), ^(X), ■ ■ ■ ,$s(X.)) is the minimax estimator for P under squared loss ll23l Example 5.4.5], However, it is no longer 
minimax under other loss functions such as l\ loss, which was investigated in ll24l . 

Note that the estimator <5, (X) subsumes the MLE Pi = ^- as a special case, since we can take the limit a —> 0 for 5i(X) 
to obtain MLE. 

The Dirichlet prior smoothed distribution estimate is denoted as Pb, where 


Pb = 


s~~^S 

Z^i=l a i 


Si= 1 a i 


. v—vS” S 

n + L-ji— 1 a i Z_vi=1 a i 


(9) 
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Note that the smoothed distribution Pg can be viewed as a convex combination of the empirical distribution P n and the prior 
distribution x . /* —We call the estimator iT(Pg) the Dirichlet prior smoothed plug-in estimator. 

X^i—1 OLi 

Another way to apply Dirichlet prior in entropy estimation is to compute the Bayes estimator for H(P) under squared 
error, given that P follows Dirichlet prior. It is well known that the Bayes estimator under squared error is the conditional 
expectation. It was shown in Wolpert and Wolf lf20) that 

7 jBayesA E[jff( p ) | X]=V ,^ (ai+Xj) + 1 N \ - ^ + ^ ^ + X t + 1), (10) 

\tt J (ai+Xi)J 

where ip(z) = Yjfy is the digamma function. We call the estimator 7T Bayes the Bayes estimator under Dirichlet prior. 


B. Non-asymptotic analysis of L 2 risk 

We adopt the conventional statistical decision theoretic framework [25l in analyzing the performance of any entropy estimator 
H = II (X). Denote by Ads all discrete distributions with support size S. The L-j risk of estimator H is defined as 

£p(tf(P)-fr) 2 , (ID 


where the expectation is taken with respect to the distribution P that generates the observations used by H. Apparently, the L 2 
risk is a function of both the unknown distribution P and the estimator H, and in order to obtain a single score that evaluates 
how well the estimator H in the worst possible case, we may want to adopt the minimax criterion [25l | [23l , and evaluate the 
maximum risk 

( 12 ) 


sup 

P£M S 


Eg (h{P) - h) . 


The L 2 risk can be decomposed into two parts: one is the bias, the other is the variance. They are defined by 


Bias(TT) = EH — H(P) (13) 

Var(TT) =E(fT-EfT) 2 . (14) 

We have 

E P (iT(P)-ir) = (Bias(P)) + Var(fT). (15) 

The literature on concentration inequalities If26l provides us with effective techniques for controlling the variance, and we 
will show it suffices to apply the Efron-Stein inequality to obtain tight bounds on Var(P(PB)). The focus of this paper is on 
bias analysis, rather than variance. How hard can it be to bound the bias? 

The reader may be tempted to use the Taylor expansion to analyze the bias of H(Pb). Unfortunately, Taylor expansion 
cannot give satisfactory results for this problem. To demonstrate this point simply, let us try to apply Taylor expansion to 
analyze a special case of iT(Pg), which is the MLE H(P n ). 

Considerable effort has been devoted to understanding the performance of the bias of MLE H(P n ). One of the earliest 
investigations in this direction is due to Miller [27), who applied the Taylor expansion and showed that, for any fixed distribution 
P, 

E H(P n ) = H(P) - + O ^ . (16) 

Equation (IT6l) was later refined by Harris li28l using higher order Taylor series expansions to yield 


= + , 17 , 

Harris’s result reveals an undesirable consequence of the Taylor expansion method: one cannot obtain uniform bounds on 
the bias of the MLE. Indeed, the term 4- can be arbitrarily large for distributions sufficiently close to the boundary of 

the simplex of probabilities. However, it is evident that both H(P n ) and H(P) are bounded above by In ,S', since the entropy 
of any distribution supported on S elements is upper bounded by In S. Conceivably, for such a distribution P that would make 
]Cf=i JT ver y l ar g e ’ we need to compute even higher order Taylor expansions to obtain more accuracy, but even with such 
efforts we can never obtain a uniform bias bound for all P. 

Harris’s analysis demonstrates a severe limitation of the Taylor expansion method. The situation is even worse if we consider 
general functionals. Indeed, if we want to analyze the bias of the plug-in estimator F(P n ) for P(P), where P(P) is an arbitrary 
functional of P, which may not be differentiable, Taylor expansion may not even be valid. 
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C. Approximation theory using positive linear operators 

An operator L defined on a linear space of functions, V, is called linear if 

L(af + /3g) = aL(f) + /3L(g ), for all f, g £ V, a, /3 £ M, 


(18) 


and is called positive, if 

L(f)> 0 for all / £V,f> 0. (19) 

It was first shown in Jiao et al. a that analyzing the bias of any plug-in estimator for functionals of parameters from any 
parametric families can be recast as a problem of approximation theory using positive linear operators |[22l . which is a subarea 
of approximation theory. Paltanea l22l provides a comprehensive account of the state-of-the-art theory in this subject. 

For a concrete example, the classical Bernstein operator B n {f) maps a continuous function / £ C[0,1] to another continuous 
function B n (f ) £ C[0,1] such that 

Bn(f)(x) = Y,fU) ("W -x) n ~*, (20) 

i=o ' / \3 / 

where C[(), 1] denotes the space of continuous functions on [0,1]. One can easily verify that this operator is positive and linear. 
Bernstein in 1912 showed that the function B n uniformly approximates any continuous function / £ C[0,1]. In other words. 


sup \B n (f)(x) - f(x)\ -> 0, n —» oo. (21) 

xe[o,i] 


The function B n {f)(x) is called Bernstein polynomial. On the other hand, the operator B n (f) has a probabilistic interpretation: 
it can be interpreted as the expectation of random variable / (^) where X follows a Binomial distribution X ~ B(n, x). In 
other words, the uniform approximation property (|2TT > is equivalent to the fact that 


sup 

xG[0,l] 


E x f 



f(x) 


sup 

xe[o,i] 


Bias / 



—>■ 0 , 


n —> oo, 


( 22 ) 


where the expectation is taken with respect to distribution X ~ Bin, x). Hence, the bias of the plug-in estimator f(X/n ) is 
precisely the approximation error of the Bernstein polynomial in approximating /. More generally, as Paltanea El Remark 
1.1.2.] argued using the Riesz-Markov-Kakutani representation theorem, positive linear operators of continuous functions can 
be “essentially” interpreted as the expectation of some plug-in estimator. Specifically, let X be a locally compact Hausdorff 
space, such as R ra . For any positive linear functional L on C r (X ), which denotes the space of continuous compactly supported 
real/complex valued functions on X, the Riesz-Markov-Kakutani representation theorem implies that there is a unique regular 
Borel measure // on X such that 

L{f) = [ f(x)dp(x), (23) 

■lx 

for all / £ C c (X). Hence, if we assume the functional L maps the indicator function over X to the constant one, then the 
measure /r is a probability measure, implying that L(f) can be interpreted as the expectation of a plug-in estimator that plugs-in 
an estimate of x with distribution // into the continuous function /. 

The rest of the paper is organized as follows. Section |TT] discusses the main results of this paper. Section [III] develops 
new results in approximation theory, and applies them to analyze the bias of the Dirichlet prior smoothed plug-in estimator. 
Section [IV] presents a non-asymptotic upper bound on the variance of the Dirichlet prior smoothed plug-in estimator, and 
section [V] proves the lower bound on the maximum risk of the Dirichlet prior smoothed plug-in estimator as well as the Bayes 
estimator under Dirichlet prior. The remaining proofs are deferred to the appendices. 


II. Main Results 


For simplicity, we restrict attention to the case where the parameter a in the Dirichlet distribution takes the form (a, a ,..., a). 
We remark that for general a the analysis goes through seamlessly to arrive at similar conclusions. 

In comparison to MLE H(P n ), where P n is the empirical distribution, the Dirichlet smoothing scheme H(Pb) has a 
disadvantage: it requires the knowledge of the alphabet size S in general. We define 


PB,i 


npi + a 
n + Sa 1 


(24) 


and 


PB,i 


E[ps,i] 


npi + a 
n + Sa 


(25) 


Pb 


n 

n + Sa 


Pn + 


Sa 

n + Sa 


Apparently, 


Uniform distribution 


(26) 
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where P n stands for the empirical distribution. Analogously, we have 


Pb = 


Sa 


-Uniform distribution, 


(27) 


n + Sa n + Sa 

where P is the true distribution. Throughout, we adopt the following standard notations: a n < b n means sup n a n /b n < oo, 
a n ^ b n means b n < a n , a n >; b n means a n < b n and a n > b n , or equivalently, there exist two universal positive constants 
c, C such that 


0 < c < liminf —2- < limsup -2- < C < oo. 

n ~^°° bn. n—f oo b n 


(28) 


The main results of this paper can now be stated as follows. 

Theorem 1. If n> Sa, then the maximum L 2 risk of H(Pb) in estimating H(P) is upper bounded as 


sup 

PeM s 


Ep (h { P b) - H { P)f < (in (l + + 


2 Sa 


Sa 


In 


n + Sa 
2 a 


+ 


2 n 


(n + Sa) 2 


3 + In 


n + Sa 
a T 1 


A S 


where a A 6 = nhn{a, b}. Here the first term bounds the squared bias, and the second term bounds the variance. 
The following corollary is immediate. 

Corollary 1. If n 2> S and a is upper bounded by a constant, then the maximum L 2 risk of II (P/i) vanishes. 
Theorem 2. If n > max{155, Sa}, then the maximum L 2 risk of H(Pb) in estimating H(P) is lower bounded as 


(29) 


sup (h(Pb)-H(P)) 2 > 1 
■eA/fo X / Z 


PGM S 


(S — 3 )a ( n + Sa 
m 


4 (n + Sa) 


S~ 1 5 2 

+ -r-h 


1 


-1 2 


where c > 0 is a universal constant that does not depend on Pb,P. 
If n < Sa, then we have 


sup E p(h(P b )-H(P)) > 
PeM s v ' 

If n < 155) then we have 

sup Ep(h(P b )-H(P)) 

PeM s v 7 


8 n 


ln z S 

16 


80?r 2 48?r 2 


+ c- 


ln 2 S 


> 


(S — 3)a f n + Sa 

-r m 

4 (n + Sa) 


[ri/ 15J 1 

8 n 16 n 


(30) 


(31) 


(32) 


where [xj is the largest integer that does not exceed x, and (x)+ = max{x,0} represents the positive part of x. 

We have the following corollary. 

Corollary 2. If n < S, then the maximum L 2 risk of H{Pb) is bounded away from zero. 

The case a = 0 corresponds to the analysis of MLE, which was conducted in J3]. For the MLE, it was shown in 0 that 
we have, for any n, 


sup E P (H(P n ) - H{P)f < In 1 + 


PeMs 


5-1 


(Inn) 2 2(ln5 + 2) 2 


and if n > 155, 


sup E P (H(P n )-H(P)) 2 > i fS 1 
PeMs 1 


5 2 


1 


2 n 


20 n 2 12n 2 


2 In 2 5 
+ C -1 


(33) 


(34) 


where c > 0 is the same universal constant as in Theorem U 

The next theorem presents a lower bound on the maximum risk of the Bayes estimator under Dirichlet prior. Since we have 
assumed that all = a, 1 < i < 5, the Bayes estimator under Dirichlet prior is 

s 


75 Bayes = f(Sa + n+ 1) - ^ 
Theorem 3. If S > e(2n + 1) and n > Sa, then 


i= 1 


a H - Xi 
5a + n 


sup Ep [h b ^ - H(P)) 2 > (hi ^ 


tp{a + Aj + 1). 


5 


(35) 


e(2n +1) 


(36) 
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If n < Sa, then 


sup 

PeM s 


E P (# Bayes - H{P)y 


> 


/ Sa + n \ \ 
\e{a + n+l)J ) + 


(37) 


Evident from Theorem[U[2] and[3]is the fact that in the best situation (i.e. a not too large), both the Dirichlet prior smoothed 
plug-in estimator and the Bayes estimator under Dirichlet prior still require at least n S samples to be consistent, which 
is the same as MLE. In contrast, the minimax rate-optimal estimator in Jiao et al. |[T| is consistent if n 'A> which is the 
best possible rate for consistency. Thus, we can conclude that the Dirichlet smoothing technique does not solve the entropy 
estimation problem. From an intuitive point of view it is also clear: both the Dirichlet prior smoothed plug-in estimator and 
the Bayes estimator under Dirichlet prior do not exploit the special properties of the entropy functional p ln(l/p), i.e. the 
functional has a nondifferentiable point at p = 0. The analysis in |lj demonstrates that the minimax rate-optimal estimator has 
to exploit the special structure of the entropy function. 


III. Approximation theory for bias analysis 

Denote by Bj, j £ N+ U {0}, the monomial functions eft/) = yfy £ I. For a linear positive functional F, we adopt the 
following notation 

B F (x) = \F(e 1 )-xF(e 0 )\, V F = F (( ei - F( ei )e 0 ) 2 ) , (38) 


which represent the “bias” and “variance” of a positive linear functional F. Define the first order and second order Ditzian-Totik 
modulus of smoothness |29| by 


wf (/, 2 h) = sup < |/(it) - f(v )| :u,v £ [0,1], \u - u| < 2 h<p 


0 ^)} 


^2 (/, h) = sup 


f{u) - 2/ 


U + V 


+ f(v) 


: u, v £ [0,1], |u — v\ < 2 hip 


Or 1 )}- 


(39) 

(40) 


A. A contribution to approximation theory 

First we recall the following result, which is a direct corollary of Il22l Thm. 2.5.1]. 

Lemma 1. If F : (7(0,1] -A R is a linear positive functional and F(e o) = 1, then we have 

\F{f) - fix) | < ■ wf if, 2/ii) + (/, h i), 

2hi(p{x) 2 

for all f £ C[0, 1] and 0 < hi < 4, where ip(x) = \]x{l — x) and h\ = y/F ((ei — xeo) 2 )/<p{x ) = s/V F + iB F (x)) 2 /tp(x). 

We remark that Lemma |T| cannot yield the desired result for f(p) = —p\np and 


(41) 


k—0 


Fif) = J2f 

Specifically, it is easy to show that 

B F (p) = 


Sa 


np + a 


Sa 


~P 


p k (l-p) n - k . 

1— pS\a np(l — p) 


Sa 


V F = 


(42) 


(43) 


(n + Sa) 2 

One can also easily show that ujf{f,2h) x h for /( x) = —a: hi a;. Regarding the second order Ditzian-Totik modulus of 
smoothness, the following lemma from J3) gives a precise characterization. 

Lemma 2. For f(x) = —x In x, x £ [0,1], the second order Ditzian-Totik modulus of smoothness xfff (/, h ) satisfies 

h 2 In 4 


^2 if, h) = 


1 + h 2 ’ 


h< 1 . 


Hence, when x —> 0, we conclude that 


/li _ V v F + i B Fjx )) 2 > \B f (x)\ 


|1 — a;5| 


(44) 


(45) 


pix) <p(x) n + Sa y/ x (l - x) 

is unbounded as x —> 0, thus does not satisfy the condition hi <1/2. Thus, we cannot directly use the result of Lemma [T] 

It turns out that the general result in Lemma Q] can be strictly improved in a general fashion. The result is given by the 
following lemma. 
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Lemma 3 . If F : C[ 0,1] —> R is a linear positive functional and F(e o) = 1, then 

I F{f) - /(a:) | < (/, -B F (x); x) + (/, h 2 ) (46) 

for all f G C[0, 1] ana! 0 < h 2 < where ip(x) = yJx(l — a;) and h 2 = \/V F /ip(x), and 

ui{f,h;x) = sup{|/(n) - /(x)| : u G [0,1], |u- x| < ft}. (47) 

Proof: Applying Lemma Q] to x - F(t \ ) we have 

| F(/)-/(F( ei ))|<|u|UM (48) 

and then ( l46l) is the direct result of the triangle inequality | F(f) — f(x)\ < | F(f) — f(F(e i))| + |/(F(ei)) — f(x )|. ■ 

We show that LemmaOis indeed stronger than LemmaQ] Firstly, due to hi > h 2 , we have Wj (/, h 2 ) < tuif(/, hi). Second, 
for x < 1 /2, we have 


B f {x) 

2hitp(x) 

, ,<p( r \ B F {x) 

■ o>! (/, 2hi) ~ 2hlip ( x ' 

- • sup 

) 0<s<l 

2hitp(s)f(s) 

(49) 


> B f {x) ■ 

sup 

f\s) 

(50) 



ai<s<l —x 




~ sup 

Wi (/, B f (x);s) 

(51) 


(E<S<1 — X 


which is almost the supremum of uii(f, \F(e i — xeo)|; s) over s G [x, 1 — x] and is no less than the pointwise result uii(f, |F(ei — 
xeo)|;x), and here we have used the inequality ip(s) > < p{x ) for x < s < 1 — x. A similar argument also holds for x > 1/2. 
Hence, Lemma 0 transforms the first order term from the norm result in Lemma Q] to a pointwise result, which may exhibit 
great advantages when we applied it to specific problems. 


B. Application of the improved general bound to our problem 
Theorem 4 . If n > max{,S'a, 4}, then 


sup E p \H(P b ) - H(P)\ < 
PeMs 


5 nS In 2 
(n + Sa) 2 


2 Sa /n + Sa\ 
n + Sa \ 2a J 


(52) 


Note that Theorem [4] implies a slightly weaker bias bound than Theorem Q] but it is only sub-optimal up to a multiplicative 
constant. The bias bound in Theorem Q] is obtained using another technique which is tailored for the entropy function, and is 
presented in the appendix. 

Now we give the proof of Theorem^] using the approximation theoretic machinery we just established. Note that h 2 = n Y^ a . 
In order to ensure that h 2 < 1/2, it suffices to take n> 4. 

In light of Lemma [3] we have 


s 

Ep\H(Pb)-H(P)\<J2 

i=1 


/ |1 — PiS\a' \ 5n In 2 \ 

\ ’ n + Sa l ) + (n + Sa) 2 ) 


f ^ |1 — Pi<5|a^ f 1 ^ |1 — Pi<S'|a^ 

< 2Sa fn + Sa\ 5nSTn2 
~ n + Sa \ 2a ) (n + Sa) 2 


5 nS In 2 
(n + Sa) 2 


(53) 

(54) 

(55) 


where we have used the fact that if \x — y\ < 1/2, x,y G [0,1], then |xlnx — plnp| < —\x — y\ ln|x — y\. The readers are 
referred to the proof of Cover and Thomas lf30l Thm. 17.3.3] for details. We also utilized the fact that if n > Sa. then for any 

i, 1 < i < S 


|1 —piS\a < Sa < 1 
n + Sa ~ n + Sa ~ 2 


(56) 


IV. Variance Analysis 

The following theorem gives the variance bound in Theorem Q] 

Theorem 5. The variance of H (P B ) is upper bounded as follows: 


Var 


(h(P b )) 


< 


2 n 


(n + Sa) 2 


3 + In 


n + Sa 
a + 1 


A S 


(57) 























Proof: We recall the following bounded differences inequality first. 

Lemma 4 . F26\ Cor. 3.2] If function f : Z n —> R has the bounded differences property, i.e., for some non-negative constants 

Cl ? ^2 ■) ? Cji, 


sup ■ ,z„)- f(z 1, • • • , Zi-1, z't, Z i+ 1, ■ ■ ■ ,z n )I < Ci, 

Zi»22»-" ,Zn,z'i&Z 


1 < i < n 


then 


n 

Var (f(Zi,Z 2 , ■ ■ ■ ,Z n ))<-J2cl 


(58) 


(59) 


i=l 


vv/zen Z\, Z 2 , ■ ■ • , Z n are independent random variables. 

In our case, apparently F(Pb) is a function of n independent random variables {2' ) } 1 <,<„ taking values in Z = {1, 2, • • • , ,S'}. 
Changing one location of the sample would make some symbol with count j to have count j + 1, and another symbol with 
count i to have count i — 1. Then the absolute value of the total change in the functional estimator is 


/ 


j + 1 + a 
n + Sa 




j + a 

n + Sa 


-f 


i + a 
n + Sa 


+ / 


i — 1 + a 
n + Sa 


< 2 max 

l<fc<n 


/ 


k + a 
n + Sa 


-f 


k — 1 + a 
n + Sa 


In light of the Taylor expansion with integral form residue, we have that for 1 > x > t > 0, 


r 

(x — t)ln(x — t) = x In x — t (In x + 1) + / 

J X 


x — t — u 


du 


so 


(x — t) ln(a; — t) — x\nx\ < t\ lnx + 1| + 


x — t 


du 


+ t < t\ In a; + 1| + 2t < t {3 — In a;). 


As a result. 


(60) 

(61) 

(62) 

(63) 

(64) 

(65) 


Lemma 5. 4261 Thm. 3.1] Let Z i, ■ ■ • , Z n be independent random variables and let f(Z±, Z 2 , ■ ■ ■ , Z n ) be a squared integrable 
function. Moreover, if (Zf Z' 2 , ■ • • , Z' n ) are independent copies of (Z±, Z 2 , ■ ■ ■ , Z n ) and if we define, for every i = 1, 2, • • • , n, 


max 

l<k<n 


k + a 
n + Sa 


-f 


k — 1 + a 
n + Sa 


< max 


< 


l <k<n n + Sa 
' 3 + In 


3 — In 


k + a 
n + Sa 


n ■ 


Sa 


n + Sa 
a T 1 


Hence, the bounded differences inequality shows that 

k + a 


Var 


{h(p b )) 


< n max / 


2 <k< 


n + Sa 


-/ 


k — 1 + a 
n + Sa 


< 


(n + Sa) 2 


3 + In 


Sa 


a T 1 


which completes the proof of the first part. 

For the second inequality, the Efron-Stein inequality gives a general upper bound on the variance. 


Pi — f(Zi,Z 2 , ■ ■ ■ , Zj_i, Z[, Z i+ 1 , • • • , Z n ) 


( 66 ) 


then 

1 ” 

Var(/)<-^E(. fi-fl) 2 . (67) 

z j=t 

Since H(Pb) = II i>,(Z \, ■ • • , Z n ) is invariant to any permutation of {Z\, Z 2 , - ■ ■ , Z„) , we know that the Efron-Stein 
inequality implies 

Var(tf(P B )) < |E (h b {Z[,Z 2 ,--- , Z n ) - H B (Z U Z 2 ,-■ ■ , Z^ (68) 

where Z' Y is an i.i.d. copy of Z\. 

Now define 

n 

Xi = HZ 3 =i), l<i<s. (69) 

i=i 

For brevity, we denote the S'-tuple (Xi,-- - ,Xs) as Xf, and the n-tuple (Z i,-- - ,Z n ) as Z ™. A specific realization of 
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(Xi, ■ ■ ■ , Xs) is denoted by xf = (xi, • • • , xs), and a specific realization of (Z, i, • • • , Z n ) is denoted by z™ = (z±, ■ 
Then we have 

E Z 2 , ■ ■ ■ , Z n ) — Hb(Zi,Z 2 , ■ ■ ■ , Z n )J 

= £>(*f =xf)E ,Z n )-H B (Z u Z 2 ,--- ,ZrS)~ \X? = x' 

xf L 

In light of 13 Lemma B.l], we know that the conditional distribution of Z\ conditioned on ( X - t , • • • , Xs) is the 
distribution ( Xi/n,X 2 /n, ■ ■ ■ ,Xs/n). Denoting r{p) = /(), we can rewrite 


Hr (Z[. Z 2 , • • • , Z n ) — Hb{Zi, Z 2 , • •• , Z n ) — D- + D + 


where 


D_ = r 


A' Zl - 1 


Xzi 


D, = 


(^)--(^) z ^ z : 
(^)-(^) z > = z i 


Here, Z)_ is the change in Hr that occurs when Z\ is removed according to the distribution (Xi/n, X 2 /n, ■ ■ ■ , 
and D + is the change in // that occurs when Z\ is added back according to the true distribution P. Now we have 

s 


E[D 2 _\x 1 s ] = J2 


X, 


i=l 

s 


Xi 


E[Dl\X?] = — 


Xj-1 

n 

Xi 


— r 


Xi 
n 

Xi - 1 


2 s 


EM 1 - 


Xi 


Xi + 1 


i =1 \ \ / \ / / i=l 

where we define r(x) = 0 when x ^ [0,1]. Then, by the law of iterated expectation, we know that 

S n . / / . 1 \ v /.\\2 

E Pi] = EE{ 

i= 1 j —0 


3~ 1 


x P(B(n,Pi) = j) 


5 n 

e ^+] = EE 

i=l j=0 


3 - 1 


+ 1- 


J + l 


x p,P(B(n,pi) = j) 


After some algebra we can show that E[Z? 2 _] = E[£)+]. Hence, we have 

Var (h(P b )) < ^E (P>_ + D+ ) 2 < ?rE (D 2 _ + £>+) = 2nEP> 

S / 2 \ 2 

= 2n^EP„(i) (r(P n (i) - -) -7-(P„(z))J 
2—1 ' ' 

5 

< 2ny^ EP ra (i) 


i=l 


2n 


Sa 


3 — In 


i—1 

S 


< 


< 


(n + Sa)'- 
2 n 

(n + Sa) 2 
2 n 

(n + Sa) 2 “ S 
2rt(3 + In S) 2 
(:n + Sa) 2 


■ Y, E Pn(i) (3 - In 




nP n (i) + a 
n + Sa 

nP n (i) + a 
n + Sa 
2 


S-i I 3 - In 


n/S + a 
n + Sa 


■■ ; Z n ). 

(70) 

(71) 

discrete 

(72) 

(73) 

(74) 

Xs/n), 

(75) 

(76) 

(77) 

(78) 

(79) 

(80) 

(81) 

(82) 

(83) 

(84) 

(85) 
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where we have used the inequality (I62t and Jensen’s inequality due to 


dx 2 


zlM^I-3 


n 


nx + a 


Sa 


_ , , , nx + a\ \ nx / , ( nx + a 

3 ( In ( ——) - 3 +- I 4 - In 


rur + a 


Sa 


< 0. (86) 


V. Proof of the lower bound 
Theorem [2] follows upon applying the inequality 


sup Ep (h(Pb) — H(P)\ >max{(7,.D} 

Pc A/f c v / 


> 


C + D 


PeMs v ' 2 

where expressions C and D are given, respectively, in the following two theorems. 
Theorem 6. There exists a universal constant c > 0 such that 


PeM s 


Theorem 7. If n> max] 155, ,S'«}, 


sup 

PeMs 


E pH(P b ) - H(P) 


sup E p (h(P b )-H(P)\ 

■f= A/f o V / 

(5 — 3)a 


> c- 


ln 5 


> 


■ In 


+ Sa\ 5-1 


If n < Sa, then 

If n < 155, then 


sup 

PeM s 


sup 

PeMs 


EpH(P B ) - H(P) 


4 (n + Sa) 

E p H(P b ) - H(P) 
(5 - 3 )a 


1 


8 n 80 n 2 48n 2 


> 


In 5 


1 


^ v _ . n + Sa\ 1_tt./15J 

4(n + 5a) \ a ) 8n 16n 


Proof: By setting P = (1,0,0, • • • , 0), we have H(P) = 0 and 


h(Pb) = i n ! “ 


n + a ( n + a \ (5 — l)a fn + Sa 

■ m -— > ---— In 1 


n + Sa 


n + Sa \n + SaJ n + Sa \n + Sa 

hence we have obtained the first lower bound 

(5 — l)a f n + Sa 


sup 

PeM s 


E p H(P b ) - H(P) 


> 


Note that if n < Sa, then 


sup 

PeMs 


E p H(P b ) - H(P) 


> 


, q in 

n + Sa 

2 Sa 4 


From now on we assume n > Sa. For n > 155, it follows from (2 Lemma 2.11] that 

5-1 5 2 1 


sup 

PeMs 


EpH(P) - H(P) 


> 


2 n 20 n 2 12?r 2 ‘ 

If n < 155, then it follows from the proof of J2 Lemma 2.11] that one can essentially take 5 = [n/15j in 

[n/15j 1 


sup 

PeMs 


E P H(P) - H{P) 


> 


2 n 4 n 


(87) 


( 88 ) 

(89) 

(90) 

(91) 

(92) 

(93) 

(94) 

(95) 
and obtain 

(96) 


It follows from a refinement result of Cover and Thomas ll30l Thm. 17.3.3] that when \p B ,i — Pi\ <1/2 for all i (which is 
ensured by condition n > Sa), we have 


\H(P B ) - H(P)\ < -^ln f 2a 


n + Sa \n + SaJ n + Sa 

A combination of these two inequalities yield the second lower bound 

1 


2 Sa (n+Sa 
< -In 


sup 

PeMs 


E p H(P b ) - H(P) 


5-1 5 2 

> —: + 


2 n 20 n 2 12 n 2 n + Sa 


25a (n + Sa 
In 


(97) 


(98) 
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when n > 155, and the second lower bound 


sup 

PeM s 


EpH(P b ) - H(P) 


> 


|n/15J 
2 n 


1 2 Sa / n + Sa\ 

4 n n + Sa \ a J 


(99) 


when n < 155. 

Hence we are done by using these two lower bounds and the inequality maxja, b} > 3a + b . ■ 

We prove Theorem [3] below. 

Proof: We first assume that n > Sa. Applying the recursive formula fix+1) = tp(x) + l/x, we know that fia+Xi + 1) > 
if) (a + 1), since X, is an non-negative integer. Hence, we have 


° - 1 - 

fjBayes < ^(Sa _|_ n + 1 ) — ^- ~f/ + 1 ) = fiSa + n + 1 ) — if; (a + 1 ). 

z —' Sa + n 
1=1 


( 100 ) 


We know that the digamma function fix) is strictly increasing from —oo to oo on (0,oo), and that for any x > l.igl, 


ln|_a;J — 7 < 'fix) < 1 + ln([xj) — 7, 


( 101 ) 


where 7 « 0.57721 is the Euler-Mascheroni constant. 

Hence, we know 

H Bayes < 1 — 7 + ln(5o + n + 1) — (—7) = 1 + ln(£a + n + 1) < 1 + ln(2 n + 1), (102) 


where we used the assumption that n > Sa. Since the Bayes estimator under Dirichlet prior is upper bounded by 1 + ln(2n +1) 
for all possible realizations, the squared error it incurs for uniform distribution is at least (In S— 1 — ln(2n + l)) 2 when 
S > e{2n + 1). The first part is proved. 

Regarding the second part, we assume n < Sa. Take P = (1,0,..., 0) € A is- Thus, 

jyBayes _ _|_ n _|_ 1) _ -—^/;(a + 1) — ^ + H 1ft (a + 71+1), a.S. (103) 

Sa + n Sa + n 

and the corresponding true entropy H{P) = 0. 

Using the monotonicity of fix) on (0,oo), we have 


fjBayes > ^g a + n + X ) _ + n + X ) 

> ln(5a + n) — 7 — (1 + ln(a + n + 1) — 7 ) 
= ln(5a + n) — 1 — ln(a + n + 1 ) 

Sa + n 


= In 


e(a + n + 1) 


Hence, the squared error incurred by the Bayes estimator under Dirichlet prior for P 


w 


Sa+n \ \ 
e(a+n+ 1) / / _|_ 


. The second part is proved. 


(104) 

(105) 

(106) 

(107) 

(1,0 ,..., 0) € Ms is at least 
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Appendix A 

Proof of the bias analysis in TheoremQ] 
The following theorem gives the upper bound on the squared bias used in Theorem Q] 


Theorem 8. If n> Sa, 


Proof: We have 


sup Ep|iT(Pe) — -ff(-P)| < In ( 1 
PeM s V 


—1 

n + Sa J 


2 Sa j fn + Sa\ 
n + Sa V 2a / 


S 

H (Pb) = ^2 ~PB,i In ps t i 
2 — 1 


s 

HiP B )+y+B, -Ps,i)lnps,i 

i= 1 


S 

yPB, In 
2—1 


PB,i 

Pb/ 


Taking expectations on both sides, we have 

EHiP B ) - H{P) = HiP B ) - H{P) -EDiP B \\P B ), 


(108) 


(109) 


( 110 ) 
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where D(P\\Q) = £L =1 ft l n is the KL divergence between distributions P and Q. In order to analyze the bias, it suffices 
to analyze the two terms separately. We first analyze KD(Pb\\Pb)- 
It follows from Jensen’s inequality that 


wiiQ) = E^ ln -^ ln (E 


S 2 ' 

• Pi 


Qi 


\ 2=1 


Qi 


( 111 ) 


whose derivation here follows from Tsybakov l3ll Lemma 2.7], 
By Jensen’s inequality, we have 


ED(Pb\\Pb) < Eln (eI^ME 


S 


E P B ,i 


We also have 


and that 


^ P? _ 1 _|_ ^ (ft - Qi) 2 


S p b •* 


Qi 


Qi 


E(p B ,i - PB,i) 2 = 7 — n c , 2 1E(ft - Pi ) 2 = 


Hence, 




Say 


= 1 " >+£ 


npi{ 1 -ft) 

(n + Sa) 2 


( 112 ) 


(113) 


(114) 


npi(l-pi) npj 

4^ (n + Sa)npi npi + a 


< ln !+E 


1 -ft 


“ (n + 5a) / ’ 

2— 1 7 / 


which implies that 


ED(P b \\Pb ) < In 1 


S- 1 \ 5-1 


< 


(115) 


(116) 


n + Sa J n + Sa 

Now we consider the deterministic gap H{Pb ) — H(P). It follows from a refinement result of Cover and Thomas lf30l Thm. 
17.3.3] that when \p B ,i — Pi \ <1/2 for all i, we have 

\\Pb~P\\i 


\H(P b ) — H(P)\ < \\P B — -P||t In ■ 

Note that the condition n> Sa ensures that \pb,i — ft| < 1/2. We compute 

s 


S 


\\Pb-P\\i = 


Sa 


Sa 


E 

i=i 


Pi 


Taking a naive bound Yli= 1 Ift — 4| < 2, we have 


II^B-P||l< 


2 5a 
n + 5a 


Hence, 


|J?(P b )-JT(P)|<- 


25a 
1 + 5a 


In 


2 a 


. T 5a / 


(117) 

(118) 

(119) 

( 120 ) 
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